The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer purchase behaviour against different products. Specifically, here the problem is a regression problem where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables.
Classification problem can also be settled in this dataset since several variables are categorical, and some other approaches could be “Predicting the age of the consumer” or even “Predict the category of goods bought”. This dataset is also particularly convenient for clustering and maybe find different clusters of consumers within it.
There are several R packages that useful for analyzing this dataset.
This dataset has 12 variables
## User_ID Product_ID Gender Age Occupation City_Category
## 1 1000001 P00069042 F 0-17 10 A
## 2 1000001 P00248942 F 0-17 10 A
## 3 1000001 P00087842 F 0-17 10 A
## 4 1000001 P00085442 F 0-17 10 A
## 5 1000002 P00285442 M 55+ 16 C
## 6 1000003 P00193542 M 26-35 15 A
## Stay_In_Current_City_Years Marital_Status Product_Category_1
## 1 2 0 3
## 2 2 0 1
## 3 2 0 12
## 4 2 0 12
## 5 4+ 0 8
## 6 3 0 1
## Product_Category_2 Product_Category_3 Purchase
## 1 NA NA 8370
## 2 6 14 15200
## 3 NA NA 1422
## 4 14 NA 1057
## 5 NA NA 7969
## 6 2 NA 15227
## Observations: 537,577
## Variables: 12
## $ User_ID <int> 1000001, 1000001, 1000001, 1000001, 1…
## $ Product_ID <fct> P00069042, P00248942, P00087842, P000…
## $ Gender <fct> F, F, F, F, M, M, M, M, M, M, M, M, M…
## $ Age <fct> 0-17, 0-17, 0-17, 0-17, 55+, 26-35, 4…
## $ Occupation <int> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, …
## $ City_Category <fct> A, A, A, A, C, A, B, B, B, A, A, A, A…
## $ Stay_In_Current_City_Years <fct> 2, 2, 2, 2, 4+, 3, 2, 2, 2, 1, 1, 1, …
## $ Marital_Status <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1…
## $ Product_Category_1 <int> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8,…
## $ Product_Category_2 <int> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, …
## $ Product_Category_3 <int> NA, 14, NA, NA, NA, NA, 17, NA, NA, N…
## $ Purchase <int> 8370, 15200, 1422, 1057, 7969, 15227,…
## [1] "5891 buyers registered at Black Friday"